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Abstract. This work studies distributed algorithms for locally optimal load-balancing: We are 
given a graph of maximum degree A, and each node has up to L units of load. The task is to 
distribute the load more evenly so that the loads of adjacent nodes differ by at most 1. 

If the graph is a path (A = 2), it is easy to solve the fractional version of the problem in 
0{L) communication rounds, independently of the number of nodes. We show that this is tight, 
and we show that it is possible to solve also the discrete version of the problem in 0{L) rounds 
in paths. 

For the general case (A > 2), we show that fractional load balancing can be solved in 
poly(L, A) rounds and discrete load balancing in f{L, A) rounds for some function /, indepen¬ 
dently of the number of nodes. 



1 Introduction 


In this work, we introduce the problem of locally optimal load balancing, and study it from the 
perspective of distributed algorithms. 

1.1 Locally optimal load balancing 

In this problem, we are given a graph G = {V, E), and each node has up to L units of load. The 
task is to distribute load more evenly so that the loads of adjacent nodes differ by at most 1: 


input: 

G: 



output: 

G: 



That is, we want to smooth out the load distribution, and find an equilibrium in which no edge 
can improve its load distribution by selfishly moving load between its endpoints. 

A bit more formally, in the load balancing problem we are given an input vector x: V ^ 
{0,1,..., L}, and the task is to find an output vector y: L —)■ [0, L] and a flow /: i? —)■ M so 
that for each node v gV we have 

y{v) = x{v)+ ^ f{u,v), (1) 

{u^v)^E 


and for each edge {u,v) G E we have 


ly(u)-y(v)l < 1. (2) 

Here is an illustration of the input and a feasible solution in the special case that G is a path: 

v: 046606044203 

y: 334433333222 

The problem comes in two natural flavours: 

• Discrete load balancing: y{v) G {0,1,..., L}, i.e., load units are indivisible. 

• Fractional load balancing: y{v) G [0,L], i.e., load units can be divided. 

1.2 Centralised algorithms 

Both discrete and fractional load balancing can be solved easily with the following algorithm: Start 
with y ^ X and / <— 0. Then repeatedly pick an unhappy edge {u, v) £ E with y{u) > y{v) + 2, 
and move one unit of load from u to v. This algorithm clearly converges, as the potential 
function y{v)‘^ decreases by at least 2 in each step. 
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1.3 Local solutions and local algorithms 

In the above centralised algorithm, we can think that each node v has a pile of y{v) tokens and 
we always move the topmost token. Then the height of a token decreases by at least one every 
time we move it; hence no individual token is moved more than L times. This argument shows 
that there always exists a local solution in which the final position of a token is always within 
distance L from its origin; that is, each token can stay in its radius-L neighbourhood. 

In this work we are interested if the problem can be solved with a local algorithm: is it 
possible to solve the problem so that we can compute the flow /(u, v) for each edge (tt, v) £ E 
based on only the information that is available within distance T from (u, v) in graph G, for 
some T. Equivalently, we want to know if there is a (deterministic) distributed algorithm in 
the usual LOCAL model that solves the load balancing problem in T communication rounds, or 
more succinctly, in time T. 

We will assume that the input graph has maximum degree A. We are interested in local 
algorithms with a running time of T = T(L, A) that may depend on the maximum load L and 
maximum degree A, but is independent on the number of nodes n = \V\. Such an algorithm 
could be used to solve load balancing even in infinitely large graphs, and it would be very easy 
to e.g. parallelise such algorithms, as each part of the output can be determined based on its 
local neighbourhood. 

1.4 Smoothing with moving average 

There is a special case that can be easily solved with a local algorithm in time T = 0{L): 
fractional load balancing in 2-regular graphs (cycles and infinite paths). We can simply calculate 
the moving average of the input loads with a window of size Q{L). More concretely, each node 
gives a fraction 1/(2L + 1) of its input load to every node (including itself) in its radius-L 
neighbourhood. This way the final loads of adjacent nodes differ by at most L/(2L -|- 1) < 1/2 
units. The same strategy can be applied easily in, e.g., d-dimensional grids. 

Among others, the present work seeks to answer the following questions: 

• Is the running time of 0{L) optimal here, or could we solve it in time o{L)l 

• Can we generalise this kind of smoothing algorithms to arbitrary graphs, and if so, what is 
the running time? 

• Can we generalise this kind of smoothing algorithms to discrete load balancing? 

1.5 Contributions 

The contributions of this work are as follows. We start with a simple lower bound: 

Theorem 1. Load balancing requires Ll{L) rounds, even in the case of paths and cycles. 

Then we prove negative results for various algorithm families that have been used widely in 
the prior work. To this end, we define the following algorithm families: 

• Match-and-balance algorithms: In each step, the algorithm finds a matching M and balances 
the load (fully or partially) for each edge in M. More precisely, for each edge {u, v) £ M 
with y{u) > y{v), the algorithm increases the flow f{u,v) by at most {y{u) — y{v))/2. For 
example, many natural distributed versions of the centralised algorithm from Section 1.2 
are of match-and-balance type. 

• Careful algorithms: In each round, for each edge {u,v) £ E, the algorithm increases or 
decreases f{u,v) by at most poly(L). All match-and-balance algorithms are also careful 
algorithms. 
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• Oblivious algorithms: The total amount of load moved from node u to v only depends on 
the initial load of u and the distance between u and v. For example, the moving average 
algorithm from Section 1.4 is oblivious. 

We show that algorithms of any of these types cannot find a locally optimal load balancing 
efficiently (or at all): 

Theorem 2. Any match-and-balance algorithm takes O(L^) rounds in the worst case, even in 
paths and cycles. 

Theorem 3. Any careful algorithm takes rounds in the worst ease. 

Theorem 4. There are no oblivious algorithms for infinite d-regular trees with d> 3. 

We then present the main contributions—local algorithms for load balancing. First, we show 
that we can circumvent the barrier of Theorem 2: 

Theorem 5. Discrete load balancing can be solved in time 0{L) in paths and cycles, with a 
deterministic local algorithm. 

Corollary 6. The time complexity of both fractional and discrete load balancing in paths and 
cycles is 0(T). 

Next we show that we can also circumvent the barriers of Theorem 3 and 4 for fractional 
load balancing—naturally, we have to design an algorithm that is neither oblivious nor careful: 

Theorem 7. Fraetional load balaneing can be solved in time poly(L, A) in bounded-degree graphs 
with a deterministie local algorithm. 

Finally, we show that discrete load balancing can be solved locally, i.e., in time that is 
independent of n: 

Theorem 8. Discrete load balancing can be solved in time T{L,A), for some function T, in 
bounded-degree graphs with a deterministic local algorithm. 

Whether there is an efficient algorithm for discrete load balancing in the general case remains 
an open question. 

2 Related work 

There is a vast body of literature related to problems that are superficially similar to locally 
optimal load balancing. However, in many cases the primary goal is something else—for example, 
achieving a near-optimal global solution—and the algorithms just happen to also find a locally 
optimal solution. 

Most of the previous solutions are inefficient. In particular, we are not aware of any solution 
that comes close to 0{L) for discrete load balancing on paths, or close to poly(L, A) for fractional 
load balancing in general graphs. In prior work, the inefficiency typically stems from at least 
one of the following factors: 

1. Inherently global problems: A lot of prior work focuses on problems that are inherently 
global—for example, the task is to find a solution such that the difference between the 
minimum load and the maximum load is at most 1. It is easy to see that any algorithm 
for solving such problems takes Il{n) rounds in the worst case. 

2. Natural but inefficient algorithms: Many papers study various natural processes for doing 
load balancing. Many of these are of match-and-balance type, and virtually all of these 
are careful. Typically, the negative results of Theorems 2 and 3 apply. 

In contrast, we study a problem that can be solved efficiently, and our algorithms demonstrate 
that it is indeed possible to break the barriers of Theorems 2 and 3. In what follows, we will 
discuss related work in more detail. 
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Reducing a global potential with local rnles. There is a lot of literature on load balancing 
when the goal is to reduce a global potential function by iterating a local balancing rule. Examples 
of such potential functions are the difference between the maximum and the minimum load 
{discrepancy), the maximum load {makespan), and the quadratic difference to the average load. 

Various models are considered: two classic models are the diffusion model, where vertices 
distribute their load to all their neighbours, and the matching model, where the load is exchanged 
only along the edges of a matching—for example a random matching or an edge colouring. 

In the continuous case, where the loads are assumed to be infinitely divisible, the speed 
of convergence was analysed for simple schemes both in the diffusion model [21, 23] and the 
matching model [6, 11]. In both the speed of convergence is essentially captured by the spectral 
properties of the graph in question. 

In the context of indivisible loads, known as the discrete case, similar problems were first 
studied for networks designed to balance the load quickly [20]. Different schemes for reducing the 
discrepancy in the discrete case were analysed, the question of whether the speed of convergence 
in the continuous case could be matched, remained open [1, 11, 12, 19]. Recently Sauerwald and 
Sun [22] were able to prove convergence as fast as in the continous case, up to constant factors. 
Reducing discrepancy is a global problem and can take linear time in the worst case. 

Semi-matching problem. In the semi-matching problem the nodes of a graph are divided 
into clients and servers [14]. Each client has to be assigned to an adjacent server. The goal is to 
optimise the total waiting time of the clients. 

Czygrinow et al. [8] presented a distributed algorithm for finding a locally optimal semi¬ 
matching in time poly (A); this also implies a factor-2 approximation of globally optimal 
semi-matchings. 

The semi-matching problem is very similar to the locally optimal load balancing problem, 
especially when limited to the case of degree 2 clients, with the tokens being more “localised”. 
Indeed, our linear lower bound can be adapted to prove an D(A) lower bound for locally optimal 
semi-matchings. 

Balls into bins. In the d-choice process each of n balls goes in the least loaded of d random 
bins. Dependency of the maximum load on the parameter d is well known [3, 16, 24]. The choice 
of the bins can be modelled by a graph [17]; in one variant the bins are connected by edges 
and each ball does a local search until it finds a local minimum [5, 7]. This process produces a 
locally optimal load balancing. 

Sandpile models and chip-firing games. Our stability condition is similar to what is used 
in sandpile models [4, 9, 15] and chip-firing games [2]. However, in these problems the goal is 
usually to describe final configurations for fixed, very simple algorithms that simulate a natural 
phenomenon. 

Filtering. Sliding window algorithms for computing the running average or for image filtering 
are natural local algorithms. Averaging type algorithms, however, cannot guarantee an integral 
solution to load balancing problems. Median filtering does guarantee integral solutions for 
integral inputs; however, it does not preserve the total load. 

Games and equilibriums. The locally optimal load balancing problem can be seen as a 
problem of finding an equilibrium state, where no single load token can gain advantage by 
moving. We show that such an equilibrium can be found locally, that is, the decisions made in 
one part of the graph do not propagate too far. This is in contrast with problems such as finding 
stable matchings, where there is a local algorithm only for finding almost-stable matchings [10]. 
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Matchings. Locally optimal load balancing is closely relate to bipartite maximal matching: if 
the initial loads are x{v) G {0,2}, then it is easy to see that a solution can be found using a 
bipartite maximal matching algorithm. This is a problem that can be solved in time 0(A) [13]. 
Showing a matching lower bounds is a major open question, and we do not expect that one can 
prove tight lower bounds for locally optimal load balancing as a function of A before we resolve 
the distributed time complexity of bipartite maximal matching. 

In our algorithms for discrete load balancing, we will use the bipartite maximal matching [13] 
algorithm as a subroutine. For fractional load balancing, we use the almost-maximal fractional 
matching algorithm due to Khuller et al. [18] as a subroutine. 

3 Negative results 

We will now prove the negative results of Theorems 1-4. For simplicity, we prove the statements 
for deterministic distributed algorithms; it is fairly straightforward to extend the results to 
randomised algorithms (e.g., consider the expected values of the outputs). 

Recall that in Section 1.1 we dehned the problem so that the output is bounded by L. 
However, we will not exploit this restriction in any of the lower-bound proofs. The negative 
results hold verbatim for a relaxed version of the problem in which the outputs can be any 
nonnegative real numbers. We only assume that the inputs are bounded by L. 

3.1 Load balancing on paths and cycles 

We start with the unconditional lower bound that holds for any algorithm, for both fractional 
and discrete load balancing, and in the simplest possible case of paths or cycles. 

Theorem 1. Load balancing requires Ll{L) rounds, even in the case of paths and cycles. 

Proof. We will give the proof for the case of paths; the case of cycles is very similar. Consider a 
path P with n nodes, labelled with the numbers 1, 2,..., n from left to right, for a sufficiently 
large n. Let H be a load-balancing algorithm. For an input x: u —)■ L, we write A{x) for the 
output of A on input x. Let h = [L/2J — 1. 

Consider the following constant inputs: xq : u —)■ 0 and xl'. v ^ L. Let yo = H(xo) and 
yi = A{xl). Clearly yo{v) = 0 for all v and yiiv) > L for at least one v. Hence we can find two 
nodes, I and r, such that 


yo(^) = 0, yL{r)>L, \r - i\ = L - 1. 

See Figure 1 for an illustration. 

W.l.o.g., assume that i < r. Let m = {r + i)/2 be the midpoint between i and r. Now 
dehne an input x such that x(z) = 0 for i < m and x{i) = L otherwise. Note that the radius-/i 
neighbourhoods of I are identical in xq and x. Similarly, the radius-h neighbourhoods of r are 
identical in x^ and x. 

Let y = A{x). If y{i) = yo(^) and y{r) = yiir), we have a contradiction: the distance 
between i and r is smaller than their load difference, and hence there has to be an unhappy 
edge between them. Therefore y{£) yoi^) or y{r) yL(x). In both cases, there is a node v 
that changed its output between two instances, even though the inputs were identical up to 
distance h. Hence the running time of A has to be at least h + 1 = 0(L). □ 

3.2 Match-and-balance algorithms 

Recall that in each round, a match-and-balance algorithm hnds some matching M, and then 
for each edge (u,v) € M with y{u) > y{v), the algorithm increases the flow f{u,v) by at most 
{y{u) — y{v))/2. Note that M does not need to be a maximal matching, a maximum matching. 
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Figure 1: The proof of Theorem 1 in Section 3.1. In this example, L = 4 and h = 1. We can find 
a node r with output at least L in yi, and a node £ with output 0 in yo so that the distance 
between £ and r is L — 1. Then we construct instance x that looks like in the /i-neighbour hood 
of r and it looks like xq in the /i-neighbourhood of £. If node r does not change its output 
between y^ and y, then node £ has to change its output between yo and y. Hence the running 
time is at least h + 1. 


or a random matching—the following lower bound holds regardless of how clever the algorithm 
tries to be in its selection of the matching M, and even if it gets the matchings in zero time 
from an oracle. 

Theorem 2. Any match-and-balance algorithm takes rounds in the worst case, even in 

paths and cycles. 

The basic idea of the proof is simple. Let H be a match-and-balance algorithm. 

1. We construct an instance in which A has to move £1{LA) units of load in total. 

2. We prove that A can move only 0{L) units of load per round. 

Hence we have a lower bound of H(L^) for the running time of A. 

We will again study the case of paths; the case of cycles is very similar. Let P be a path 
with 2n + I nodes, labelled with —n, —n + I,..., n from left to right. We say that a load vector 
is monotone if y{i) > y{j) for all i < j; see Figure 2. The key feature of match-and-balance 
algorithms is that a monotone load vector remains monotone after each step. 

Lemma 9. Match-and-balance algorithms maintain a monotone load eonfiguration on P. 
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Figure 2: The proof of Lemma 9 in Section 3.2. In this example, L = 8 and the load distribution 
is monotone. A match-and-balance algorithm can only move at most L/2 = 4 units of load (the 
highlighted tokens). 

Proof. Assume that the current load configuration y is monotone. Let M he a matching and 
let y' be the load configuration after balancing over M. Consider nodes i and i + 1. Initially 
y{i) > y{i + 1); we will prove by a case analysis that y'{i) > y'{i + 1): 

1. {i,i + 1} G M: we will have y'{i) > y'{i + 1). 

2. + i M: 

• {i,i — 1} ^ M: we will have y'{i) = y{i). 

• {i,i — 1} G M: we will have y'{i) > y{i). 

• {i + I, i + 2} ^ M: we will have y'{i + 1) = y(i + 1). 

• {i + I, i + 2} G M: we will have y'{i + 1) < 2/(i + 1). 

In each case y\i) > y'{i + 1). □ 

In a monotone conhguration, we can only move 0{L) units of load per round—see Figure 2. 

Lemma 10. Any match-and-balance algorithm A can move at most units of load in a single 
round on path P with a monotone load configuration. 

Proof. Since A maintains a monotone load conhguration, the sum of the load differences over all 
edges is at most L. Therefore even if M contains all edges with a non-zero load difference, the 
algorithm can move only at most L/2 units of load per round in total. □ 

Proof of Theorem 2. We will consider the input vector x where x{i) = L for i < 0 and x{i) = 0 
otherwise. The vector is monotone and hence it remains monotone throughout the execution 
of A. Consider the output of node 0. There are two cases; see Figure 3; 

(a) The output of node 0 is at most h = L/2. Now for each i = 0,1,..., h — I, we can observe 

that the load of node —i has decreased by at least h—i units, and by monotonicity, all of this 
load has been moved to the right. In particular, for each i we have moved h — i units of load 
from node —i over at least i + I edges. The total amount of work done by the nonpositive 
nodes is at least the tetrahedral number 1 ■ h-{-“2, ■ {h — 1) h ■ 1 = 0(/i^) = 0(L^). 

(b) The output of node 0 is at least h = L/2. Now for each i = 0,1,..., /i — 1, we can observe 

that the load of node i has increased by at least h — i units, and by monotonicity, all of 
this load has been moved from the left. The total amount of work done by the nonnegative 
nodes is at least 0(L^). 

By Lemma 10, moving 0(L^) units of load takes n(L^) rounds. □ 

3.3 Careful algorithms 

Recall that careful algorithms move 0{L) units of load per round—this includes, for example, 
all match-and-balance algorithms, as well as many other natural algorithms that simulate the 
physical process of collapsing piles of tokens. 
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3 ^: 

P\ 

/: 


Figure 3: The proof of Theorem 2 in Section 3.2. We construct input x, run any match-and- 
balance algorithm, and have a case analysis based on the output of node 0: (a) The load of node 
0 decreases by at least h, and the nonpositive nodes do n{h^) units of work in total to push 
load to the right, (b) The load of node 0 is still at least h, and the nonnegative nodes do 
units of work in total to pull load from the left. 

Theorem 3. Any careful algorithm takes rounds in the worst case. 

Proof. Construct the input {G, x) as shown in Figure 4; We have a tree Gu rooted at u, a tree 
G.V rooted at v, plus an edge {u, u}. Both trees are of depth L/4; each non-leaf node has d — 1 
children. All nodes of Gu have an input load of 0, and all nodes of Gu have an input load of L. 

Now consider any solution {y, /). If y{u) > L/4, then all nodes of Gu have a load of at least 
1, and there are d^^^^ nodes in Gu. All of the load has been moved across the edge {u, v}, and 
hence f{v,u) = dP'^^'>. Otherwise y{u) < L/4, and y{v) < L/4 -|- 1. In this case all nodes of Gu 
have a load of at most L — 1, and again we can conclude that /(u, u) = 

A careful algorithm starts with y x and / -^ 0 and changes each element of / by at most 
poly(L) in each round. Hence any careful algorithm has to spend for this instance. □ 

3.4 Oblivious algorithms 

Recall that in an oblivious algorithm, the total amount of load moved from node u to u only 
depends on the initial load of u and the distance between u and v. For example, the algorithm 
that computes the moving average in an infinite path is an oblivious algorithm. We show that 
such algorithms do not exist for infinite regular trees of a degree larger than 2. 
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/z = L/4 /z = L/4 


Figure 4: The proof of Theorem 3 in Section 3.3. In this example, d = 3 and L = 12. 

Theorem 4. There are no oblivious algorithms for infinite d-regular trees with d> 3. 

Proof. We say that a node is full if it has a load of L, and empty if it has a load of 0. We say 
that a subtree is full if all nodes in it are full, and a subtree is empty if all nodes in it are empty. 
Construct the input {G, x) as shown in Figure 5a; 

• G is the d-regular infinite tree, 

• {u, u} is an edge of G, 

• each node w that is closer to u than v is empty, 

• each node w that is closer to v than u is full. 

We will consider the infinite tree G rooted at either u or v. If we root it at u, then u is adjacent 
to 1 full subtree and d — 1 empty subtrees. If we root it at v, then v is adjacent to d — 1 full 
subtrees and 1 empty subtree. See Figure 5a for illustrations. 

Let g{r) be the amount of load that the oblivious algorithm moves from a full node w to any 
node that is at distance r from w. Define the shorthand notation 

OO 

a = ^{d-lYg{r + l), 

r=0 

which has two equivalent interpretations in rooted infinite d-regular trees: 

• a full root node sends in total a units of load to each subtree, 

• the root node receives a units from each full subtree. 

See Figure 5b. In total, a full node gives da units of load to other nodes, so it leaves 

(3 = L — da 

units of load for itself. It is easy to verify that we must have /3 > 0 and hence a < T/d; otherwise 
there would be inputs with negative outputs. 

Now we are ready to put the pieces together. Node u receives a units of load from its only 
full subtree, while v receives (d— l)a units of load from its d — 1 full subtrees; moreover, v leaves 
j3 units of load for itself. The load difference between u and v is therefore at least 

d - 2 

(d — l)a + fi — a = L — 2a> — - — L. 

Hence for any d > 3 and for a sufficiently large L, edge {u, u} will be unhappy in G, no matter 
which oblivious algorithm we apply. □ 
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13 a + (3 2a + ^ 3a + (3 = L 

k1"a A"a A"a A"a 

Figure 5: The proof of Theorem 4 in Section 3.4. In this example, d = 3. (a) The input graph G 
is an infinite d-regular tree rooted at {u, u}; one half has an input load of 0, and the other half 
has an input load of L. (b) In the output, each full subtree contributes a units of load, and the 
node itself contributes (3 units of load. 

4 Positive results 

We will now prove the positive results of Theorems 5, 7, and 8. 

4.1 Discrete load balancing in paths and cycles 

We first give an algorithm that exactly matches the lower bound of Theorem 1. 

Theorem 5. Discrete load balancing can he solved in time 0{L) in paths and cycles, with a 
deterministic local algorithm. 

Infinite directed paths. We will first show how to do load balancing in an infinite path with 
a consistent orientation. That is, each node v has a degree of 2, and it can refer to its left 
neighbour v — 1 and right neighbour u + 1 in a globally consistent manner. 

We will interpret the path with tokens as a 2-dimensional grid, indexed by (u,i), where 
u G V is a node and i G {1,..., L} is a possible location for a token. We say that {v, i) is a slot. 
Initially, slot (v, i) holds a token if x{v) > i. Our plan is to move the tokens around in the grid 
so that we maintain the following stability conditions—see Figure 6 for an illustration. 

Definition 1. A token in slot {v, i) is /c-stable if f = 1 or there is a token in slot {v -\- k,i — 1). 
A configuration is /c-stable if all tokens are /c-stable. For a set K, a configuration is iF-stable if 
it is /c-stable for all k ^ K. 

We write [a, 6]] = {a, a -|- 1,..., 6}. Initially, the configuration is 0-stable. If we can find a 
[—1, Ij-stable configuration, we can construct a feasible solution to the load balancing problem 
by simply setting y{v) to be equal to the number of tokens in slots {v, •). 
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slots: 


□ □□□□(IdDDDDDDDS 

nodes: s t u v 

Figure 6: Stability. Token (s, 3) is 2-stable, as there exists a token in slot (u, 2), where tt = s -|- 2, 
i.e., the node 2 steps right from s. Also (s, 2) and (s, 1) are 2-stable. However, this confignration 
is not 2-stable: token (t,3) is not 2-stable, as there is an empty slot (u,2). It can be verified 
that the configuration is 0-stable, 1-stable, and (—l)-stable. 


However, we will now design an 0(L)-time algorithm with a stronger stability condition: it 
will compute a [—3,3]]-stable configuration. Informally, we smooth out the load distribution 
so that the slope of the load curve is at most 1/3. This extra slack will be helpful when we 
eventually want to solve the problem in paths without consistent orientations. 

This algorithm is based on the concept of pushes. For a node v and integer i, define the 
t'-diagonal of v as the following list of slots (see Figure 7): 

S{v, £) = {{v -£,l),iv-2e,2),...,(v- L£, L)) 

In an t'-push we redistribute the tokens in each S{v,£): if there are k tokens in S{v,£), then we 
redistribute the tokens so that the first k elements of S{v,£) are occupied and the remaining 
L — k elements are empty (see Figure 7). In essence, we let the tokens slide along each diagonal 
so that they are piled on the bottom of each diagonal. 
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Figure 7: Pushing. The 1-diagonal of v is highlighted. In a 1-push, we redistribute the tokens in 
each 1-diagonal so the end result will be a 1-stable configuration. Note that this configuration 
was already 0-stable and (—l)-stable, and it remained 0-stable and (—l)-stable after a 1-push. 
In general, whatever stability we have already achieved by pushing is never lost in subsequent 
pushes. 

An Apush can be efficiently implemented in time 0{£L) with a distributed algorithm: for 
example, node v is responsible for redistributing the tokens in slots S{v,£), and we first use 
0{£L) rounds so that each node v can discover everything related to S{v,£), and then another 
0{£L) rounds so that node v can inform the relevant nodes regarding how to move tokens in 
S{v,£). 

Clearly, after an Apush we will have an .^-stable configuration. The non-trivial part is that 
Apushes do not interfere with any stability that we have previously achieved. 

Lemma 11. For every choice of integers £ and k, if a configuration is k-stahle, then it is still 
k-stable after an £-push. 
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Proof. The case k = £ is trivial; hence we assume that k ^ Consider slots a = (v, i) and 
b = (v + k,i — 1); see Figure 8. We need to argue that if a holds a token after an £-push, then b 
will also hold a token after an ^-push. To this end, let A be the ^-diagonal that contains a, and 
let B be the t'-diagonal that contains b. Now by dehnition, after an £-push, slot a is occupied if 
and only if there were at least i tokens in A. 


A 

Figure 8: The proof of Lemma 11, for I = —2 and k = 1. The conhguration was 1-stable, i.e., 
the gray 1-diagonals filled starting from the bottom. Now we do a (—2)-push, and want to argue 
that the configuration will be still 1-stable. Consider slot a. It will be filled iff there are at least 
5 tokens in the (—2)-diagonal A. But this implies that there are at least 4 tokens in the diagonal 
B, and hence the slot b will be filled, too. 

The key observation is that fc-stability implies that for every token in A there is a token in B, 
with the exception of the first token—if {u,j) £ A holds a token and j > 1 then {u + k,j — l) G B 
holds a token as well. In particular, if there were at least i tokens in A, there were at least i — 1 
tokens in B, and hence b will also hold a token. □ 

Now we can easily find a [—3, 3]]-stable configuration in time 0{L): the algorithm simply does 
an Cpush for each £ £ [[—3, 3]], sequentially, in an arbitrary order. We will call this algorithm Ai. 

Finite directed paths and cycles. Algorithm Ai finds a [—3,3]]-stable configuration in 
infinite directed paths in time 0{L). To handle finite directed paths we could extend the 
algorithm and its analysis so that it takes into account the boundary effects. However, this 
would be a bit boring—instead, we will show that we can simply take Ai and use it as a black 
box. 

Let us first adjust the stability condition so that it makes sense on finite paths: a token (u, i) 
is considered /c-stable also if node v + k does not exist. 

Let Ti = 0(L) be the worst-case running time of Ai. We use it to construct an algorithm 
A 2 that hnds a [—3, 3]]-stable configuration for a hnite path G, as follows: 

1. Check if the path is of length at most 4Ti; if so, we solve the problem by brute force in 
time 0{L), and stop. 

2. Each endpoint u gathers all tokens up to distance 2Ti and redistributes them so that all 
nodes within distance at most Ti from u have the same constant load; let us denote this 
constant c{u). 

3. Construct a virtual graph G' as follows: each endpoint u pretends that the path continues 
with inhnitely many additional dummy nodes, each with the same constant load c{u). 

4. Simulate algorithm Ai in the virtual graph G'. 

5. Discard the dummy nodes. 
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It is easy to verify that the Ai will never move any tokens across an endpoint, as its neighbourhood 
was already well-balanced. Therefore if we remove the dummy nodes, we have a feasible solution 
for G. Moreover, the running time of A 2 is still 0{L). 

It is also easy to see that A 2 works correctly in directed cycles; the first three steps simply 
do nothing as there are no endpoints. 


Undirected paths and cycles. So far we have designed an algorithm A 2 that finds a [—3,3]]- 
stable configuration in paths and cycles with a globally consistent orientation. Now we show 
how to use it to design an algorithm A^ that finds a [—1, Ij-stable configuration in paths and 
cycles without an orientation. 

It can be shown that some form of local symmetry-breaking is needed. We will use the 
familiar port-numbering model: Each node v has up to two communication ports, labelled with 
(u, 1) and (v, 2). The ports are identified with the endpoints of the edges; each edge joins a pair 
of ports. The port numbers at the endpoints of an edge do not need to match—for example, an 
edge {u, u} may join (u, 1) to (u, 1) or {u, 1) to (u, 2). 

In algorithm A 2 , we construct a virtual graph G' as shown in Figure 9: Each node v splits 
itself in two virtual nodes, vi and V 2 - The virtual nodes also have two ports. For each edge 
e = {u, u}, depending on the type of e we connect the virtual nodes of u and v as follows; 

• e joins (u, 1) to (u,l): connect (ui,l) to (u 2 , 2 ) and (u 2 , 2 ) to (ui,l), 

• e joins (u, 1) to (u,2): connect (ui,l) to (ui,2) and (u 2 , 2 ) to (u 2 ,l), 

• e joins (u, 2) to (u,l): connect (ui,2) to (u 2 ,l) and (u 2 ,1) to (ui,2), 

• e joins (u, 2) to (u,2): connect (ui,2) to (ui,l) and (u 2 ,1) to (u 2 , 2 ). 

If G was a path with n nodes, then G' consists of two disjoint paths with n nodes each. If G was 
an n-cycle, then G' consists of either one cycle with 2n nodes or two cycles with n nodes each. 


G: 


12112122121212 
O-O-O-O-O-O-O-O V 



G’: 


12112I22I21212 

o— q: jo— or o —o—o— 


o 






21221211212121 


1 

2 


12121212121212 
O-O-O-O-O-O-O-O 

12121212121212 
O-O-O-O-O-O-O-O 


Figure 9: Given any path G with some port numbering, we can construct a virtual graph G' 
that consists of two paths, both of which have a consistent port numbering. 

The key observation is that there is a consistent port numbering in G: port 1 of a virtual 
node is always connected to port 2 of an adjacent virtual node. We can now interpret the ports 
so that in each virtual node port 1 points “left” and port 2 points “right”. 

Each node first splits its input load arbitrarily between its virtual copies. Then we run 
algorithm A 2 to find a [—3, 3]]-stable configuration in the virtual graph, and then map all tokens 
back to the original graph: the new load of v is the sum of the new loads of vi and V 2 ; see 
Figure 10. 

Now we have a configuration where the maximum load difference between a pair of adjacent 
nodes is 2. However, the load is approximately well-balanced: a load difference of more than 2 
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Figure 10: The sum of two [—3, 3]]-stable configurations can be easily turned into a [—1, Ij-stable 
configuration with local modifications. 


implies a distance of at least 4. Therefore we can easily find a [—1, Ij-stable configuration in 
0(1) time with local operations (see Figure 10). For example, we can apply a match-and-balance 
algorithm: find a maximal matching M of unhappy edges and move a token over each edge. 
Conveniently, all edges become happy, including those that were not in M. It is easy to find a 
maximal matching M in 0(1) time, as this is in essence maximal matching in a bipartite graph 
of maximum degree 2 : on one side we have the nodes that are “too low” and on the other side 
we have the nodes that are “too high” in comparison with their neighbours. 

In summary, we can find a [—1, Ij-stable configuration in any path or cycle in time 0(L), 
and therefore we can do discrete load balancing in any path or cycle in time 0 (L). 

4.2 Discrete load balancing in general graphs 

We will now show how to do discrete load balancing in graphs of maximum degree A. 

Theorem 8 . Discrete load balancing can be solved in time T(L, A), for some function T, in 
bounded-degree graphs with a deterministic local algorithm. 

Again, we will imagine that each node v has L slots, labelled {v, •), and each token is placed 
in one of the slots. Initially slots {v, 1), (u, 2),..., (v, x{v)) are occupied with tokens. 

We define the (downward) cone C{v,i) of slot {v,i) as the set of slots (u,j) 7 ^ iv,i) such 
that i — j > dist(u, u); see Figure 11. In the algorithm, if there is a token in (u, i) and all slots of 
the cone C{v,i) are full, then we say that the token is stable, and we freeze it, i.e. it will never 
be moved again. 



Figure 11: The downward cone C{v,4:) of the token t = (u,4) consists of the slots denoted with 
white boxes. 

In the algorithm we try to match the highest unfrozen tokens with the free slots in their 
cones. If they succeed then they move to these slots; otherwise they can be frozen. 
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We now give the pseudo-code of the algorithm in a centralised way, prove the correctness of 
the algorithm, and then show that it is actually a local algorithm. The algorithm proceeds as 
follows: 

1. All stable tokens of the initial configuration are frozen. 

2. For each h = L, L — 1,... ,1: 

(a) Construct the virtual bipartite graph Fh = {T U S, E), where T consists of unfrozen 
tokens at level h, S consists of all empty slots at levels below h, and there is an edge 
{t, s} if s G 5 is an empty slot in the cone of token t £ T. 

(b) In Ffi, find a maximal matching M. 

(c) For every unfrozen token t at level h: if the token is matched with a slot s in M, 
move the token to slot s, otherwise freeze it. 

(d) Collapse the tokens so that for each node v that holds k tokens, the tokens are in the 
slots {v,l),{v,2),...,{v,k). 

First, remark that we maintain the invariant that at round h, all load in slots at height h 
either moves down or is safely frozen. Indeed, if a token is not matched, then all slots in its 
cone will be full at the end of the loop, and if it is matched, it moves to a strictly lower level, 
thereafter the invariant is true for level h and maintained for the levels above. At the end of the 
algorithm all the tokens are frozen, thus the configuration is stable. 

We stated the algorithm in a centralised manner, but it is actually local: The vertices 
only need the knowledge of their radius-L neighbourhood to find their neighbours in graph Fh. 
Graph Fh has a maximum degree of 0{LA^). Therefore we can find a maximal matching in 
Fh by simulating O(LA^) rounds of the proposal algorithm [13] in the virtual graph Fh- The 
simulation has a multiplicative 0(L) overhead—adjacent nodes in Fh are at distance 0{L) in 
graph G. Finally, we have 0{L) iterations, giving the overall complexity of O(L^A^). 

4.3 Fractional load balancing in general graphs 

In fractional load balancing, we can use the same basic idea as what we had in the discrete case, 
but much faster: 

Theorem 7. Fractional load balancing can be solved in time poly(L, A) in bounded-degree graphs 
with a deterministic local algorithm. 

Our algorithm follows the same basic structure as the discrete algorithm of Section 4.2. 
However, in each bipartite virtual graph Fh, we compute an e-maximal fractional matching. 
With the algorithm by Khuller et al. [18], this can be done in 0(log ^ -|- log A) rounds, which 
gives us an exponential speedup over the 0(A)-round algorithm for maximal bipartite matching. 

Almost maximal fractional matchings. Let the bipartite graph be G = {V, E) with 
V = TU S and A be the maximum degree. Each node has a capacity c: F —)■ [0,1]. A fractional 
matching is a function y: E —)■ [0, 1 ] such that for each node, the sum y[v] = 
most c{v). A fractional matching is £-maximal if 

maxmin{c(u) — y[v] : v G e} < e, 

e&E 

that is, there is no edge e with a value y{e) that could be increased by more than e. 
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Algorithm for fractional load balancing. Next we describe the algorithm for finding a 
fractional load balancing. As before, we have slots labelled with here v is a node and i is 

the level of the slot. However, now each slot may contain fractional units of load. We adapt the 
definition of stability to fractional load balancing in a natural manner: we say that a. units of 
the load in slot (u, i) is stable if each (u, j) with i — j = d{v, u) has at least a units of load, and 
each {u,j) with i— j > d{v,u) is full. In the algorithm we will freeze some parts of the load. 
We use £i{v) to denote the total amount of load in slot {v,i), and fi{v) to denote the amount of 
frozen load in slot (u, i). 

The algorithm would be simpler to analyse if we had a maximal fractional matching algorithm, 
but we only have an efficient e-maximal one. Our strategy is to round the load, and by doing it, 
we accumulate surplus and deficit. This way we can analyse easily the iterations of the algorithm. 
We keep track of surplus and deficit, and at the end of the algorithm we readjust the loads. 

The algorithm for fractional load balancing works as follows. First, double all load; this 
way we have 2L slots per node. Based on the new input, freeze all stable load. Then, for each 
h = 2L, 2L — !,...,! perform the following steps; 

1. Construct the virtual bipartite graph Fh = (T U S,E), where T consists of slots with 
unfrozen load at level h, S consists of all slots at levels below h, and there is an edge {f, s} 
if s G 5 is a non-full slot in the cone of slot t £ T. 

2. Define the capacities of Fh as follows: the capacity of a node {t, h) G T is its unfrozen load 
c{t, h) = ih{s) — fhit), and the capacity of a node {s, i) £ S is its free space c(s, i) = l—£i{s). 

3. Find an e-maximal fractional matching y in Fh- 

4. For each edge e = {t, s} of Fh, move y(e) units of load from slot t to slot s. 

5. For each t consider the two cases: 

• If t has load at most e, round it to 0 and freeze the load in t. We create at most e 
units of deficit in slot t at the moment we freeze it. 

• If t has strictly more than e load, then each unfrozen slot in the cone of t has at least 
1 — e load. We round them to 1 and freeze all load at t and in C{t). We create at 
most e units of surplus in each slot of C{t) at the moment we freeze it. 

6. Collapse all unfrozen load as low as possible. 

Finally, undo the rounding—remove the surplus and put back the deficit. Then normalise the 
output by dividing all load by two. 

Analysis. Thanks to the rounding, the conhguration after each iteration satisfies the same 
invariant as the discrete algorithm: at round h, all load in slots at height h either moves down 
or is safely frozen. Then the configuration at the end of the iteration phase is stable and for 
each edge {u,u} we have \y{u) — y{v)\ < 1. 

Each slot is involved at most once in the rounding, precisely at the moment we freeze it. 
Hence before normalisation, the difference between the real load and the rounded load is in 
[—2L£,2Le\ for each node. Therefore we have \y{u) — y(u)| < 1 -|- 4Le for each edge {u, u} 
before normalisation and \y{u) — y{v)\ < 1/2 -|- 2Le after normalisation. We guarantee a feasible 
solution by choosing e < 1/(4L). 

Khuller et al. [18] show how to find an e-maximal fractional matching in time 0(log +log d) 
in graphs of maximum degree d. The virtual graph Fh has a maximum degree of d = 0{LA^), 
and there is an 0{L) overhead in the simulation of Fh in G. Finally, we have L iterations in the 
algorithm; in total, the running time can be bounded by 

0(L^(loge“^ -|- logd)) = 0(L^log A). 

This completes the proof of Theorem 7. 


16 



5 Conclusions 


In this work, we have introduced the problem of hnding a locally optimal load balancing, and 
studied its distributed time complexity. We have shown that the problem can be solved in a 
strictly local fashion, but to do it, one has to resort to algorithms that are very different from 
typical load-balancing strategies that are used in the literature. Among the key findings are: 

• an 0(L)-time algorithms for discrete load balancing in paths and cycles, 

• a poly(L, A)-time algorithm for fractional load balancing in graphs of maximum degree A. 

The main open question is the distributed time complexity of the discrete load balancing 
problem. Our algorithm is local, but it has a running time exponential in L; the key question 
is whether poly(L, A)-time algorithms exist. We suspect that it is related to another long¬ 
standing open question—the distributed time complexity of bipartite maximal matching. Indeed, 
a polylog(A)-time algorithm for bipartite maximal matching would imply a poly(L, A)-time 
algorithm for discrete load balancing. We conjecture that such algorithms do not exist, but 
proving such lower bounds seems to be still beyond the reach of current techniques. 

Another open question is the generalisation of the results from the LOCAL model to the 
CONGEST model. In particular, the polynomial-time algorithm for fractional load balancing 
heavily abuses the unlimited bandwidth of the LOCAL model, but it seems that there are no 
major obstacles for designing an analogous algorithm that works efficiently in the CONGEST 
model. 
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